-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add availability-recovery from systematic chunks #1644
Add availability-recovery from systematic chunks #1644
Conversation
Signed-off-by: alindima <[email protected]>
…atic-chunks-av-recovery
…atic-chunks-av-recovery
…vailability-recovery-strategies
this avoids sorting the chunks for systematic recovery, which is the hotpath we aim to optimise
when querying the runtime api
…vailability-recovery-strategies
…ecovery-strategies' into alindima/add-systematic-chunks-av-recovery
The CI pipeline was cancelled due to failure one of the required jobs. |
…atic-chunks-av-recovery
c07b56d
to
25dc880
Compare
this is needed in order to trigger chunk requests because PoVs smaller than 4Mibs will prefer fetching from backers
…atic-chunks-av-recovery
We should still explore switching to https://github.com/AndersTrier/reed-solomon-simd since it seems much faster than our own reed-solomn crate.
We could explore relaxing this restriction somewhat, after we know how things respond to this change in the real networks. Also this 1 depends somewhat upon the number of validators.
Just a nit pick, we still need all approval checkers to rerun the encoding to check the encoding merkle root, so while the decoding becomes almost nothing, the overall availability system remains expensive. |
I don't really think we overload the backers, but we do try backers as first option, and if that fails we try systematic chunks. Going back to backers doesn't really makes much sense in this setup. |
noted: #605 (comment) |
Current strategy is to try backers first if pov size is higher than 1 Mib. Otherwise, try systematic recovery (if feature is enabled). During systematic recovery, if a few validators did not give us the right chunk, we try to retrieve these chunks from the backers also (we only try each backer once). This is so that we may still do systematic recovery if one or a few validators are not responding
Of course. I discovered though that decoding is much slower than encoding. This PR, coupled with some other optimisations I did in our reed-solomon code still enable us to drop the CPU consumption for erasure coding from ~20% to 5% (as measured on versi, on a network with 50 validators and 10 glutton parachains). |
* master: (93 commits) Fix broken windows build (#4636) Beefy client generic on aduthority Id (#1816) pallet-staking: Put tests behind `cfg(debug_assertions)` (#4620) Broker new price adapter (#4521) Change `XcmDryRunApi::dry_run_extrinsic` to take a call instead (#4621) Update README.md (#4623) Publish `chain-spec-builder` (#4518) Add omni bencher & chain-spec-builder bins to release (#4557) Moves runtime macro out of experimental flag (#4249) Filter workspace dependencies in the templates (#4599) parachain-inherent: Make `para_id` more prominent (#4555) Add metric to measure the time it takes to gather enough assignments (#4587) Improve On_demand_assigner events (#4339) Conditional `required` checks (#4544) [CI] Deny adding git deps (#4572) [subsytem-bench] Remove redundant banchmark_name param (#4540) Add availability-recovery from systematic chunks (#1644) Remove workspace lints from templates (#4598) `sc-chain-spec`: deprecated code removed (#4410) [subsystem-benchmarks] Add statement-distribution benchmarks (#3863) ...
**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes paritytech#598 Also implements [RFC paritytech#47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](paritytech#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: paritytech#598 (comment) #### TODO: - [x] [RFC paritytech#47](polkadot-fellows/RFCs#47) - [x] merge paritytech#2177 and rebase on top of those changes - [x] merge paritytech#2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see paritytech#598 (comment) - [x] Rewrite the implementer's guide documentation - [x] paritytech#3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>
**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes paritytech#598 Also implements [RFC paritytech#47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](paritytech#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: paritytech#598 (comment) #### TODO: - [x] [RFC paritytech#47](polkadot-fellows/RFCs#47) - [x] merge paritytech#2177 and rebase on top of those changes - [x] merge paritytech#2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see paritytech#598 (comment) - [x] Rewrite the implementer's guide documentation - [x] paritytech#3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>
Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged
Fixes #598
Also implements RFC #47
Description
ValidatorIndex
->ChunkIndex
mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47.Performance results
Some results from subsystem-bench:
with regular chunk recovery: CPU usage per block 39.82s
with recovery from backers: CPU usage per block 16.03s
with systematic recovery: CPU usage per block 19.07s
End-to-end results here: #598 (comment)
TODO: